Skip to content

Conversation

@michaelr524
Copy link
Collaborator

@michaelr524 michaelr524 commented Jan 3, 2026

PR Type

Enhancement


Description

  • Fallacy checker refactored with profile-based configuration, single-pass full-document extraction, and comprehensive telemetry tracking via PipelineTelemetry class

  • Multi-extractor support with LLM judge for issue aggregation, deduplication, and configurable filter chain (principle-of-charity, supported-elsewhere)

  • OpenRouter client refactored to direct HTTP API with reasoning budget support, temperature normalization across providers, and unified usage metrics

  • Fallacy extractor enhanced with multi-model support (Claude and OpenRouter), configurable parameters, date context injection, and telemetry capture

  • New fallacy judge tool for aggregating and deduplicating issues from multiple extractors with decision logic (accept/merge/reject)

  • Reasoning budget resolver for OpenRouter models with caching, provider-specific limits, and client-safe UI display formatting

  • Validation framework with comparison logic, regression detection, baseline management, and corpus document tracking in MetaEvaluationRepository

  • Job orchestrator updated with profile support, improved type safety, and pipelineTelemetry persistence

  • Plugin manager enhanced with profile configuration and telemetry collection from plugins

  • New filter tools for principle-of-charity and supported-elsewhere evaluation using LLM-based filtering

  • Unified LLM filter utilities abstracting Claude and OpenRouter API differences with model detection and reasoning configuration

  • Lab validation feature with TypeScript types, API endpoints for runs/baselines/profiles, and UI hooks for validation management

  • Model discovery utilities for fetching and caching available models from Anthropic and OpenRouter APIs

  • Fuzzy deduplication strategies for comparing extraction issues with multiple similarity algorithms


Diagram Walkthrough

flowchart LR
  A["Fallacy Checker Plugin"] -->|"profiles"| B["Profile Loader"]
  A -->|"multi-extract"| C["Multi-Extractor"]
  C -->|"parallel"| D["Fallacy Extractors<br/>Claude/OpenRouter"]
  C -->|"aggregate"| E["Fallacy Judge"]
  A -->|"filter chain"| F["Filter Tools"]
  F -->|"principle-of-charity"| G["Charity Filter"]
  F -->|"supported-elsewhere"| H["Support Filter"]
  G -->|"LLM calls"| I["LLM Filter Utils"]
  H -->|"LLM calls"| I
  D -->|"LLM calls"| J["Claude Wrapper<br/>OpenRouter Client"]
  E -->|"LLM calls"| J
  J -->|"reasoning budget"| K["Reasoning Budget<br/>Resolver"]
  A -->|"telemetry"| L["Pipeline Telemetry"]
  L -->|"metrics"| M["Job Orchestrator"]
  M -->|"save results"| N["Database"]
  N -->|"validation"| O["Validation Framework"]
  O -->|"compare"| P["Comparison Logic"]
Loading

File Walkthrough

Relevant files
Enhancement
27 files
index.ts
Fallacy checker refactored with profiles, telemetry, and
multi-extractor support

internal-packages/ai/src/analysis-plugins/plugins/fallacy-check/index.ts

  • Refactored to support profile-based configuration with
    FallacyCheckPluginOptions for flexible profile loading
  • Implemented single-pass full-document extraction instead of
    chunk-based processing for better context
  • Added comprehensive telemetry tracking via PipelineTelemetry class
    capturing all pipeline stages
  • Introduced configurable filter chain (principle-of-charity,
    supported-elsewhere) with dynamic dispatch
  • Integrated multi-extractor support with LLM judge for issue
    aggregation and deduplication
  • Added helper methods for resolving thinking/reasoning configuration
    across different model types
+855/-186
openrouter.ts
OpenRouter client refactored to direct HTTP with reasoning budget
support

internal-packages/ai/src/utils/openrouter.ts

  • Replaced OpenAI SDK wrapper with direct HTTP API client for full
    control over OpenRouter-specific parameters
  • Added comprehensive type definitions for OpenRouter request/response
    structures and reasoning configuration
  • Implemented callOpenRouter() low-level API, callOpenRouterChat() for
    simple completions, and callOpenRouterWithTool() for tool calling
  • Added unified usage metrics integration capturing cost, cache tokens,
    and reasoning tokens
  • Implemented temperature normalization across different provider ranges
    (Anthropic, OpenAI, Google, etc.)
  • Added reasoning budget resolution with provider-specific limits and
    explicit token budget support
+668/-31
index.ts
Fallacy extractor enhanced with multi-model support and telemetry

internal-packages/ai/src/tools/fallacy-extractor/index.ts

  • Added support for both Claude and OpenRouter models via conditional
    dispatch based on model ID format
  • Implemented configurable extraction parameters (model, temperature,
    thinking, custom prompts, thresholds)
  • Added unified usage metrics and actual API params capture for
    telemetry
  • Refactored to support single-pass full-document mode in addition to
    chunk-based extraction
  • Integrated date context injection to prevent false positives on recent
    dates
  • Moved system/user prompts to separate prompts.ts file for better
    maintainability
+263/-216
index.ts
New fallacy judge tool for multi-extractor issue aggregation

internal-packages/ai/src/tools/fallacy-judge/index.ts

  • New tool for aggregating and deduplicating issues from multiple
    extractors using LLM judge
  • Implements decision logic: accept (single/multi-source), merge
    (duplicates), reject (low-confidence)
  • Supports both Claude and OpenRouter models with configurable reasoning
    effort
  • Includes environment variable parsing for judge configuration
    (FALLACY_JUDGE, FALLACY_JUDGES)
  • Captures unified usage metrics and actual API parameters for cost
    tracking
  • Provides judge label generation and reasoning display utilities
+636/-0 
reasoningBudget-client.ts
New client-safe reasoning budget resolver for UI display 

internal-packages/ai/src/utils/reasoningBudget-client.ts

  • New client-safe synchronous version of reasoning budget resolver for
    UI components
  • Calculates reasoning token budgets based on effort level and
    provider-specific max completion tokens
  • Implements dynamic output reserve calculation to ensure sufficient
    tokens for tool responses
  • Provides display-friendly budget formatting (e.g., "12.5K") for
    user-facing UI
  • Supports explicit budget (max_tokens) vs effort-based reasoning
    configuration
+239/-0 
MetaEvaluationRepository.ts
Validation framework and baseline management repository methods

internal-packages/db/src/repositories/MetaEvaluationRepository.ts

  • Added deleteSeries() method to delete a series and all its associated
    runs with proper foreign key constraint handling
  • Added comprehensive validation framework methods:
    getValidationCorpusDocuments(), getEvaluationSnapshots(),
    getEvaluationSnapshotById() for retrieving evaluation data
  • Added validation baseline management methods:
    createValidationBaseline(), getValidationBaselines(),
    getBaselineSnapshots(), deleteValidationBaseline(),
    getBaselineDocumentIds()
  • Added validation run tracking methods: createValidationRun(),
    updateValidationRunStatus(), addValidationRunSnapshot(),
    getValidationRuns(), getValidationRunDetail(), deleteValidationRun(),
    getBaselineSnapshotByDocument()
  • Changed nullish coalescing operators from || to ?? for proper
    null/undefined handling in firstRunAt and lastRunAt calculations
  • Removed unnecessary null checks for docVersion with explanatory
    comment about TypeScript type guarantees
  • Enhanced getRecentDocuments() with optional titleFilter parameter for
    case-insensitive title search and increased result limit from 30 to
    100
+701/-7 
profile-loader.ts
Fallacy checker profile loading and validation framework 

internal-packages/ai/src/analysis-plugins/plugins/fallacy-check/profile-loader.ts

  • New file implementing profile loading and validation for fallacy
    checker configurations
  • Provides functions to load profiles by ID, load default profiles for
    agents, and fall back to defaults on errors
  • Implements comprehensive validation and merging of profile
    configurations with defaults
  • Includes profile CRUD operations: createProfile(), updateProfile(),
    deleteProfile()
  • Validates model configurations, thresholds, prompts, filter chains,
    reasoning settings, and provider preferences
  • Handles migration from old filter chain format to new format
+478/-0 
llm-filter-utils.ts
Unified LLM filter utilities for Claude and OpenRouter APIs

internal-packages/ai/src/tools/shared/llm-filter-utils.ts

  • New file providing shared utilities for LLM-based filter operations
  • Abstracts differences between Claude API and OpenRouter API calls with
    unified interface
  • Implements model detection, reasoning configuration building, and
    thinking/reasoning parameter conversion
  • Provides callLLMFilter() main function for unified LLM calls with tool
    use support
  • Includes document truncation utilities for context management and date
    context generation to prevent temporal reasoning errors
  • Exports types for reasoning config, provider preferences, API
    parameters, and response metrics
+427/-0 
compare.ts
Validation comparison and regression detection logic         

meta-evals/src/validation/compare.ts

  • New file implementing comparison logic for validation framework
  • Implements string similarity calculation using Levenshtein distance
    for fuzzy matching
  • Provides comment matching between baseline and current snapshots with
    confidence scoring
  • Implements regression detection for score drops, lost comments,
    high-importance comment loss, and telemetry anomalies
  • Includes telemetry extraction and analysis with thresholds for
    extraction drops and duration spikes
  • Provides formatting utilities for comparison results and status
    determination
+389/-0 
reasoningBudget.ts
Reasoning budget calculation and resolution for OpenRouter

internal-packages/ai/src/utils/reasoningBudget.ts

  • New file implementing reasoning budget resolver for OpenRouter models
  • Calculates optimal reasoning token budgets based on effort levels and
    provider-specific limits
  • Implements caching mechanism for model endpoint data with TTL-based
    invalidation
  • Provides both async and synchronous budget resolution functions
  • Handles model-specific API compatibility (explicit budget vs
    effort-based reasoning)
  • Includes dynamic output reserve calculation to ensure sufficient
    tokens for tool responses
+399/-0 
JobOrchestrator.ts
Job orchestrator profile support and type safety improvements

internal-packages/jobs/src/core/JobOrchestrator.ts

  • Added JobProcessingOptions interface with optional profileId for
    plugin configuration
  • Updated processJob() signature to accept optional options parameter
    for profile ID passing
  • Changed setupSessionTracking() from async to synchronous with removed
    null checks (TypeScript guarantees)
  • Removed unnecessary null checks in prepareJobData() with explanatory
    comments about type guarantees
  • Updated executeAnalysis() to use options-based signature for passing
    profileId to analyzeDocument()
  • Changed saveAnalysisResults() to properly type analysisResult as
    DocumentAnalysisResult and save pipelineTelemetry
  • Improved saveHighlights() to properly type comments and remove
    redundant null checks
  • Changed nullish coalescing from || to ?? for proper null/undefined
    handling in comment field assignments
  • Updated logging to include profile ID information when processing jobs
+90/-93 
multiExtractor.ts
Multi-extractor parallel execution and deduplication         

internal-packages/ai/src/analysis-plugins/plugins/fallacy-check/extraction/multiExtractor.ts

  • New file implementing multi-extractor runner for parallel fallacy
    extraction
  • Provides runMultiExtractor() to execute multiple extractors in
    parallel with aggregated results
  • Implements reasoning configuration resolution from profile settings
    with backward compatibility for legacy thinking boolean
  • Provides issue deduplication using Jaccard word-overlap similarity
    with quality-based duplicate resolution
  • Includes extractor result flattening and quality scoring for extracted
    issues
  • Implements comprehensive logging and error handling for parallel
    extraction operations
+359/-0 
wrapper.ts
Claude API extended thinking and telemetry support             

internal-packages/ai/src/claude/wrapper.ts

  • Added ThinkingConfig interface for extended thinking configuration
    with budget tokens
  • Enhanced ClaudeCallOptions with thinking parameter supporting boolean
    or ThinkingConfig object
  • Added ClaudeActualParams, ClaudeResponseMetrics interfaces for
    telemetry tracking
  • Implemented extended thinking support with automatic temperature
    adjustment (must be 1 when thinking enabled)
  • Added response metrics collection including latency, token usage,
    cache metrics, and stop reason
  • Implemented unified usage metrics calculation with cost estimation
  • Updated callClaudeWithTool() to use tool_choice: 'auto' when thinking
    is enabled (incompatible with forced tool choice)
  • Improved error handling with enhanced max_tokens truncation detection
  • Changed from deprecated withRetry to inline retry logic with
    exponential backoff
  • Added comprehensive telemetry capture for API calls and responses
+158/-28
types.ts
Lab validation feature TypeScript type definitions             

apps/web/src/app/monitor/lab/types.ts

  • New file defining TypeScript types for the Lab (Validation) feature
  • Defines baseline, corpus document, validation run, and snapshot types
    for validation framework
  • Includes comparison data types for tracking matched, new, and lost
    comments
  • Defines filter configuration types for principle-of-charity,
    supported-elsewhere, severity, and confidence filters
  • Includes extractor and judge configuration types with reasoning and
    provider preferences
  • Defines profile configuration structure with models, thresholds,
    prompts, and filter chain
  • Includes API parameter and response metrics types for telemetry
    tracking
+340/-0 
index.ts
Validation framework barrel export                                             

meta-evals/src/validation/index.ts

  • New file serving as barrel export for validation framework
  • Exports types and comparison functions from validation module
+8/-0     
fuzzy-dedup.ts
Fuzzy deduplication strategies for extraction issues         

meta-evals/src/components/extractor-lab/fuzzy-dedup.ts

  • Implements four fuzzy deduplication strategies (exact, Jaccard,
    Fuse.js, uFuzzy) for comparing extraction issues
  • Provides similarity calculation functions and quality scoring based on
    text length and severity/confidence/importance metrics
  • Includes deduplication logic that keeps higher-quality issues when
    duplicates are found
  • Exports multi-strategy runner and helper functions for flattening
    extractor results
+323/-0 
usageMetrics.ts
Unified usage metrics across API providers                             

internal-packages/ai/src/utils/usageMetrics.ts

  • Defines unified usage metrics interface supporting both OpenRouter and
    Anthropic APIs
  • Implements Anthropic pricing table with model-specific rates for
    input/output/cache tokens
  • Provides conversion functions to normalize usage data from different
    providers into consistent format
  • Includes cost calculation and aggregation utilities for multi-provider
    usage tracking
+261/-0 
index.ts
Principle of charity filter tool implementation                   

internal-packages/ai/src/tools/principle-of-charity-filter/index.ts

  • Implements principle of charity filter tool that evaluates issues
    under charitable interpretation
  • Uses LLM to determine if flagged issues remain valid when author's
    argument is interpreted charitably
  • Separates issues into valid and dissolved categories with detailed
    reasoning
  • Includes context extraction and document truncation for efficient LLM
    processing
+326/-0 
config.ts
Multi-extractor configuration parser and utilities             

internal-packages/ai/src/analysis-plugins/plugins/fallacy-check/extraction/config.ts

  • Parses multi-extractor configuration from FALLACY_EXTRACTORS and
    FALLACY_JUDGE environment variables
  • Generates unique extractor labels and IDs based on model and
    configuration parameters
  • Provides temperature defaults for Claude vs OpenRouter models
  • Supports profile-based configuration loading from database
+319/-0 
index.ts
Supported elsewhere filter tool implementation                     

internal-packages/ai/src/tools/supported-elsewhere-filter/index.ts

  • Implements filter tool to check if flagged issues are
    supported/explained elsewhere in document
  • Uses LLM to search document for supporting evidence and determine if
    issues should be filtered
  • Separates results into supported and unsupported issues with location
    tracking
  • Includes evidence keyword detection and document context truncation
+295/-0 
types.ts
Pipeline telemetry types for observability                             

internal-packages/ai/src/analysis-plugins/plugins/fallacy-check/telemetry/types.ts

  • Defines comprehensive telemetry types for pipeline execution tracking
    and observability
  • Includes stage metrics, filtered/passed item records, and extraction
    phase telemetry
  • Tracks per-extractor metrics, judge decisions, and profile
    configuration information
  • Provides pipeline stage constants and complete execution record
    structure
+342/-0 
profile-types.ts
Fallacy checker profile configuration types                           

internal-packages/ai/src/analysis-plugins/plugins/fallacy-check/profile-types.ts

  • Defines fallacy checker profile configuration types for database
    storage
  • Includes model, threshold, prompt, and filter chain configuration
    structures
  • Provides filter type definitions and migration utilities for backwards
    compatibility
  • Exports default configurations and profile creation helpers
+317/-0 
route.ts
Validation run finalization API endpoint                                 

apps/web/src/app/api/monitor/lab/runs/[id]/finalize/route.ts

  • Implements API endpoint to finalize validation runs by comparing
    baseline and new evaluation snapshots
  • Performs comment matching using Jaccard similarity and tracks
    changed/unchanged documents
  • Saves comparison results including pipeline telemetry and stage
    metrics to database
  • Handles error cases and updates run status appropriately
+278/-0 
PipelineTelemetry.ts
Pipeline telemetry collector with fluent API                         

internal-packages/ai/src/analysis-plugins/plugins/fallacy-check/telemetry/PipelineTelemetry.ts

  • Implements fluent API for collecting and aggregating pipeline
    execution metrics
  • Tracks stage timing, input/output counts, costs, and API parameters
  • Records filtered and passed items with reasoning for debugging
  • Provides finalization method to generate complete execution records
    with version tracking
+300/-0 
openrouter-types.ts
OpenRouter API types and utilities                                             

internal-packages/ai/src/utils/openrouter-types.ts

  • Defines client-safe type definitions for OpenRouter API integration
  • Includes request/response types, tool definitions, and reasoning
    configuration
  • Provides constants for common OpenRouter models and temperature ranges
    by provider
  • Exports utility functions for provider detection and temperature
    normalization
+257/-0 
PluginManager.ts
Plugin manager profile configuration and telemetry             

internal-packages/ai/src/analysis-plugins/PluginManager.ts

  • Adds profile configuration support for FallacyCheckPlugin with
    fallacyCheckProfileId and fallacyCheckAgentId options
  • Collects and returns pipelineTelemetry from plugins in analysis
    results
  • Improves error handling with better type checking for error messages
  • Removes unnecessary async/await and fixes variable naming issues
+47/-28 
allModels.ts
Model discovery and information utilities                               

internal-packages/ai/src/utils/allModels.ts

  • Fetches available models from both Anthropic and OpenRouter APIs with
    caching
  • Provides model information including context length, temperature
    support, and reasoning capabilities
  • Implements filtering and grouping utilities for model discovery
  • Includes temperature presets for model configuration UI
+183/-0 
Miscellaneous
1 files
lab-exports.ts
Standalone lab exports avoiding circular dependencies       

internal-packages/ai/src/analysis-plugins/plugins/fallacy-check/extraction/lab-exports.ts

  • Provides standalone type definitions and config parsing for Extractor
    Lab without circular dependencies
  • Duplicates configuration types and parsing logic to avoid import
    cycles with plugin system
  • Includes extractor result types and multi-extractor configuration
    structures
  • Exports label generation and ID generation utilities for telemetry
    correlation
+315/-0 
Additional files
101 files
CLAUDE.md +75/-0   
route.ts +28/-0   
route.ts +96/-0   
route.ts +33/-0   
route.ts +38/-0   
route.ts +56/-0   
route.ts +135/-0 
route.ts +59/-0   
route.ts +33/-0   
route.ts +26/-0   
route.ts +151/-0 
route.ts +135/-0 
route.ts +22/-0   
route.ts +52/-0   
route.ts +59/-0   
route.ts +116/-0 
route.ts +101/-0 
types.ts +10/-0   
client-layout.tsx +6/-0     
BaselineCard.tsx +49/-0   
BaselineList.tsx +27/-0   
CreateBaselineModal.tsx +359/-0 
AllEvaluationsList.tsx +200/-0 
RunDetail.tsx +131/-0 
ExtractorEditor.tsx +80/-0   
FilterChainEditor.tsx +491/-0 
JudgeEditor.tsx +56/-0   
ModelConfigurator.tsx +403/-0 
ModelSelector.tsx +160/-0 
ProfileDetailView.tsx +606/-0 
ProfilesList.tsx +129/-0 
ProviderSelector.tsx +214/-0 
ExtractorCards.tsx +363/-0 
ItemCards.tsx +136/-0 
PipelineView.tsx +434/-0 
SnapshotComparison.tsx +269/-0 
pipelineUtils.ts +109/-0 
BaselinesTab.tsx +105/-0 
HistoryTab.tsx +303/-0 
RunTab.tsx +312/-0 
useAllEvaluations.ts +68/-0   
useBaselines.ts +63/-0   
useCorpusDocs.ts +40/-0   
useDefaultPrompts.ts +37/-0   
useModelEndpoints.ts +107/-0 
useModels.ts +86/-0   
useProfiles.ts +110/-0 
useRuns.ts +73/-0   
page.tsx +534/-0 
formatters.ts +54/-0   
createToolAPIHandler.ts +1/-2     
dev-env.sh +69/-13 
setup_db.sh +3/-0     
lint-pr-strict.sh +242/-0 
package.json +29/-0   
markdown.ts +0/-16   
dedup.ts +182/-0 
index.ts +9/-0     
types.ts +325/-0 
index.ts +23/-0   
types.ts +1/-0     
index.ts +62/-16 
server.ts +15/-0   
Tool.ts +9/-23   
testRunner.ts +3/-2     
types.ts +41/-0   
index.ts +25/-43 
client-types.ts +223/-0 
configs.ts +1/-1     
prompts.ts +117/-0 
types.ts +82/-4   
config.ts +12/-0   
prompts.ts +33/-0   
types.ts +178/-0 
index.ts +13/-1   
types.ts +8/-0     
generated-schemas.ts +51/-7   
config.ts +13/-0   
prompts.ts +64/-0   
types.ts +91/-0   
index.ts +8/-8     
prompts.ts +53/-0   
types.ts +91/-0   
common.ts +148/-0 
index.ts +7/-0     
modelConfigResolver.ts +253/-0 
analyzeDocument.ts +43/-23 
types.ts +17/-1   
index.ts +11/-12 
index.ts +1/-1     
migration.sql +2/-0     
migration.sql +48/-0   
migration.sql +61/-0   
migration.sql +22/-0   
migration.sql +2/-0     
schema.prisma +120/-19
.eslintrc.json +6/-0     
process-pgboss-worker.ts +16/-8   
JobService.ts +6/-1     
jobTypes.ts +2/-0     
Additional files not shown

michaelr524 and others added 5 commits January 3, 2026 19:59
Based on user feedback from LessWrong/EA Forum about false positives,
aggressive flagging, and missing context issues.

Key changes planned:
- Single-pass full document extraction (replaces chunking)
- Multi-stage filtering (charity, supported elsewhere, dedup)
- Simplified review (summarization only)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Was backwards: "defending weak claim by switching to strong one"
Now correct: "defending controversial claim by retreating to defensible one"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add DB-level title search with case-insensitive LIKE query
- Increase document limit from 30 to 100
- Add debounced search input with spinner
- Fix 'q' key quit issue when typing in search field
- Improve date format to human-readable (Dec 27, 2025)
- Fix alignment with fixed-width title padding

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add deleteSeries() to MetaEvaluationRepository
- Add delete confirmation modal in MainMenu (d key, y/n confirm)
- Improve API error handling with human-readable messages
- Switch dev-env.sh from zellij to tmux

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Plugin now passes full documentText for analysis instead of splitting into chunks
- Extractor uses documentText when text param is not provided (single-pass mode)
- Made text param optional in FallacyExtractorInput to support both modes
- Backwards compatible: chunk mode still works when text+chunkStartOffset provided

This reduces code complexity and provides better context to the LLM
by analyzing the full document at once.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Jan 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
roast-my-post Ready Ready Preview Jan 23, 2026 1:52pm

Request Review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 3, 2026

Important

Review skipped

Too many files!

12 files out of 162 files are above the max files limit of 150.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- Add SupportedElsewhereFilterTool that checks if flagged issues are
  actually supported/justified elsewhere in the document
- Integrate filter into fallacy-check plugin between extraction and
  comment generation phases
- Add debug logging to fallacy extractor and filter for visibility
- Add restart command to dev-env.sh with buffer clearing
- Update implementation notes with next steps (model testing,
  per-claim verification, extraction prompt improvements)

Results on test document show filter correctly identifies claims that
are justified by technical explanations later in the document. Opus
filters more aggressively (0 issues) vs Sonnet (1-2 issues).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add callOpenRouterWithTool() wrapper for OpenRouter API tool calling
- Add Gemini 3 Pro/Flash model IDs to OPENROUTER_MODELS
- Add temperature normalization per provider (Anthropic 0-1, others 0-2)
- Update supported-elsewhere filter to use OpenRouter for non-Claude models
- Add FALLACY_FILTER_MODEL env var for easy model switching
- Increase max_tokens to 8000 for OpenRouter (Gemini Pro needs more)
- Add error logging for tool call failures

Tested with Gemini 3 Flash ($0.003) and Pro ($0.054) - both agree
with Opus that all 5 issues are supported elsewhere (vs Sonnet keeping 1-2).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… restart

- Add model parameter to FallacyExtractorInput for OpenRouter models
- Support FALLACY_EXTRACTOR_MODEL env var for easy model switching
- Use callOpenRouterWithTool for non-Claude models (Gemini, GPT, etc.)
- Clear visible screen before scrollback in dev-env restart

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
michaelr524 and others added 3 commits January 7, 2026 11:08
- Update model testing results (Opus, Sonnet, Gemini Flash/Pro comparison)
- Document OpenRouter integration for multi-model testing
- Reorganize next steps by pipeline stage (extraction, filtering, review)
- Add planned filters: Principle of Charity, dedup/severity threshold
- Add cross-cutting concerns: multi-expert aggregation, observability, validation
- Add section 3.8: Prioritized implementation plan with 4 phases
- Include risk table with mitigations

Key insight: Phase 1 (observability + validation) must come first -
can't improve what you can't measure.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Meta-eval scoring for comment quality (accuracy, clarity, tone)
- Review stage improvements based on meta-eval feedback
- Feedback loop to iterate on prompts over time

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create telemetry module with StageMetrics, PipelineExecutionRecord types
- Add PipelineTelemetry collector class with fluent API
- Track 5 pipeline stages: extraction, dedup, filter, comment-gen, review
- Persist telemetry to EvaluationVersion.pipelineTelemetry JSON field
- Refactor FallacyCheckPlugin with helper methods for cleaner code

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add validation types (EvaluationSnapshot, DocumentComparisonResult, RegressionFlag)
- Add comment comparison logic with fuzzy matching (Levenshtein similarity)
- Add regression detection: score drop, lost comments, high-importance loss, extraction drop
- Add Validation screen to meta-evals CLI with Corpus/Compare/Results tabs
- Add repository methods for corpus queries and evaluation snapshots
- Clarify Settings UI shows judge model is for Score/Rank flows

TODO: Add baseline selection (pinned golden baseline vs latest run)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add ValidationBaseline and ValidationBaselineSnapshot tables
- Add repository methods for baseline CRUD
- Update Validation UI with baseline management:
  - Create/delete/select baselines
  - Run pipeline on baseline documents
  - Compare new results vs saved baseline
  - Save results as new baseline
- Show change summary: "X kept, +Y new, -Z lost" per document
- Use [=] unchanged / [~] changed instead of pass/fail icons
- Clarify main menu labels (Score/Rank vs Validation)
- Remove emoji from menu items

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
michaelr524 and others added 2 commits January 7, 2026 12:55
- MainMenu now only has 4 options: Score/Rank, Validation, Settings, Exit
- Created ScoreRankMenu component with series list, create, delete
- Settings remains as modal overlay in MainMenu
- Updated App.tsx routing for new screen structure
- Navigation: SeriesDetail and CreateBaseline now return to ScoreRankMenu

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add ValidationRun and ValidationRunSnapshot tables for persisting runs
- Capture per-item filter reasoning in pipeline telemetry (filteredItems)
- Record filter reasons from supported-elsewhere-filter and review stages
- Display filter reasoning for lost comments in validation UI
- Distinguish filtered comments (⊘) from not-extracted comments (−)
- Simplify UI: remove Results tab, auto-navigate to History after run
- Show all comments in scrollable list (no more "and X more" truncation)
- Add legend and summary breakdown (X filtered, Y not extracted)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…uter direct API

Multi-extractor system:
- Run multiple extractors in parallel with different models/settings
- Optional LLM judge for aggregation (disabled by default, uses simple dedup)
- Per-extractor configuration via FALLACY_EXTRACTORS env var

New extractor config options:
- `thinking: boolean` - Enable/disable extended thinking (Claude) or reasoning (OpenRouter)
- `temperature: number | "default"` - Explicit temp or use model's native default

OpenRouter direct API:
- Replaced OpenAI SDK with direct HTTP calls for full parameter control
- Proper `reasoning_effort` support: none/minimal/low/medium/high/xhigh
- New `callOpenRouterChat()` for non-tool-calling use cases
- Updated claim-evaluator to use new API

Telemetry & UI:
- Track temperatureConfig and thinkingEnabled per extractor
- Display extraction params in validation UI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add new Extractor Lab screen to main menu
- Allows running fallacy extraction directly without full pipeline
- Configure multiple extractors with different models/temperatures
- Uses same validation corpus as Validation screen (50 docs)
- Display format matches Create Baseline (numbered, with dates)
- Export @roast/ai/fallacy-extraction module for external use

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update package.json export to use dist files instead of src
- Use static import instead of dynamic import in ExtractorLab
- Fixes ERR_REQUIRE_CYCLE_MODULE error when running extraction

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
michaelr524 and others added 2 commits January 22, 2026 10:29
- Add new "All Evals" tab to Lab UI showing recent user-facing evaluations
  with their pipeline telemetry (not just validation runs)
- Add API endpoint /api/monitor/lab/evaluations to fetch evaluation versions
  with pipelineTelemetry data
- Track items that pass through filters (not just filtered out items):
  - Add PassedItemRecord type to telemetry
  - Record passed items in principle-of-charity and supported-elsewhere filters
  - Display passed items in PipelineView (collapsed by default)
- New components: AllEvaluationsList, PassedItemCard, useAllEvaluations hook

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add whitespace-nowrap to prevent text wrapping
- Reduce padding and gap for better fit
- Shorten 'All Evals' to 'Evals'
- Add flex-shrink-0 to icons

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused imports: getGlobalSessionManager, ToolChainResult, LIMITS,
  getMultiExtractorConfig, DEFAULT_THRESHOLDS, DEFAULT_FILTER_CHAIN
- Remove unused helper functions: escapeMd, sanitizeUrl
- Remove unused type import: ReasoningEffort (keep re-export for compat)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dead code cleanup - method was defined but never called.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
michaelr524 and others added 5 commits January 23, 2026 12:00
- Remove unnecessary optional chains and nullish coalescing
- Remove unused imports and variables
- Fix async functions without await
- Remove redundant type assertions and conditions
- Add dev/scripts/lint-pr-strict.sh for PR-scoped strict linting

Reduces strict lint warnings from 108 to 18 in the ai package.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace || with ?? for nullish coalescing on optional array access
- Remove unnecessary defensive check that TypeScript guarantees

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused formatTimeout import
- Replace job: any with proper inline type
- Add void prefix to async signal handlers
- Remove async from sync setupSessionTracking function
- Remove unnecessary defensive checks (TypeScript guarantees values)
- Fix nullish coalescing (|| to ??) for submittedBy?.id
- Remove unnecessary optional chain on agentVersion
- Add .eslintrc.json and tsconfig.lint.json for type-aware linting

Remaining 3 warnings are `any` types that would require exporting
types from @roast/ai - acceptable tradeoff for now.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@roast/ai:
- Add DocumentAnalysisResult interface to shared/types.ts
- Export DocumentAnalysisResult from workflows/index.ts and server.ts
- Export PluginType from index.ts (was commented out)
- Use named type in analyzeDocument and analyzeDocumentUnified

@roast/jobs:
- Import DocumentAnalysisResult and Comment from @roast/ai
- Replace all `any` types with proper types in JobOrchestrator
- Remove unnecessary defensive checks revealed by proper typing
- Fix nullish coalescing (|| null to ?? undefined) for Prisma

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix floating promises in hooks with void operator
- Remove unused imports and variables
- Fix unnecessary type assertions and optional chains
- Add exhaustive switch cases in tab components
- Fix react-hooks/exhaustive-deps warnings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Create tools/client-types.ts with type definitions extracted from tool
implementations to avoid pulling in server dependencies when importing
types for UI components.

- Add DocumentChunkerOutput, TextLocationFinderOutput, CheckMathOutput,
  CheckSpellingGrammarOutput, ExtractFactualClaimsOutput, and related types
- Export all client-safe types from @roast/ai index
- Fix Tool import in createToolAPIHandler.ts to use @roast/ai/server

This fixes CI failures where web app typecheck couldn't find tool types
that were commented out due to server dependency issues.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
michaelr524 and others added 2 commits January 23, 2026 12:56
Document the proper verification workflow when making changes to
internal packages vs web app only. Key insight: turbo typecheck
rebuilds packages first (like CI), while per-package typecheck
uses potentially stale dist/ folders.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract inline type annotations into properly named interfaces across
web app, @roast/ai, and @roast/jobs packages.

Web app:
- Add shared RouteIdParams for Next.js 15 dynamic route params
- Add prop interfaces for 8 UI components
- Add RunProgress interface for useState in page.tsx

@roast/ai:
- Add DuplicateMatch<T> generic for dedup matching
- Add ResolvedReasoning, DeduplicationResult interfaces
- Add ExtractorCallResult, JudgeCallResult type aliases

@roast/jobs:
- Add JobWithAgentVersions interface

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
michaelr524 and others added 3 commits January 23, 2026 13:13
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove obvious/stale comments that don't add value
- Replace console.log with context.logger.debug in fallacy-extractor
- Simplify error handling in fallacy-judge config parsing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- config.ts: Replace 6 console.warn() calls with logger.warn()
- openrouter.ts: Replace console.warn/error with logger methods
- PluginManager.ts: Remove stale comments and debugging notes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@qodo-code-review
Copy link

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Sensitive data exposure

Description: The extractor logs user-supplied content (e.g., textPreview via textToAnalyze.substring(0,
100) and other document metadata), which can expose sensitive/PII data in logs and any
downstream log aggregation; additionally, similar telemetry/logging patterns in
internal-packages/ai/src/analysis-plugins/plugins/fallacy-check/index.ts record and
persist quoted text/reasoning, so ensure log/telemetry sinks are treated as sensitive and
are redacted/access-controlled.
index.ts [142-165]

Referred Code
// Debug logging for development
context.logger.debug(
  `[FallacyExtractor] Running: model=${modelId || "default"} (${isOpenRouterModel ? "OpenRouter" : "Claude"}), mode=${input.text ? "chunk" : "single-pass"}, docLength=${textToAnalyze.length}`
);

// Audit log: Tool execution started
context.logger.info(
  "[FallacyExtractor] AUDIT: Tool execution started",
  {
    timestamp: new Date().toISOString(),
    promptVersion: PROMPT_VERSION,
    textLength: textToAnalyze.length,
    textPreview: textToAnalyze.substring(0, 100),
    minSeverityThreshold: MIN_SEVERITY_THRESHOLD,
    maxIssues: MAX_ISSUES,
    hasDocumentText: !!input.documentText,
    hasChunkOffset: input.chunkStartOffset !== undefined,
    mode: input.text ? "chunk" : "single-pass",
  }
);



 ... (clipped 3 lines)
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing actor context: Admin profile update/delete actions are logged without including the authenticated userId,
making it difficult to reconstruct who performed the change.

Referred Code
    logger.info("Profile updated", { profileId: id });

    return NextResponse.json({ profile });
  } catch (error) {
    logger.error("Error updating profile:", error);
    return commonErrors.serverError("Failed to update profile");
  }
}

/**
 * DELETE /api/monitor/lab/profiles/[id]
 * Delete a profile
 */
export async function DELETE(
  request: NextRequest,
  { params }: RouteIdParams
) {
  const userId = await authenticateRequest(request);
  if (!userId) return commonErrors.unauthorized();

  const adminCheck = await isAdmin();


 ... (clipped 19 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Leaky error details: OpenRouter errors include full/raw response payload text in the thrown Error message,
which can expose internal provider details if surfaced outside secure logs.

Referred Code
    // Include full error body for debugging (especially useful for 429 rate limits)
    errorDetails = ` | Full response: ${errorText}`;
  } catch {
    // If not JSON, include raw text
    if (errorText) {
      errorDetails = ` | Response: ${errorText.substring(0, 500)}`;
    }
  }

  throw new Error(`OpenRouter API error (${response.status}): ${errorMessage}${errorDetails}`);
}

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Logs user content: The extractor writes textPreview (a substring of analyzed document text) to INFO-level
audit logs, which can leak user-provided content and potential PII into logs.

Referred Code
// Audit log: Tool execution started
context.logger.info(
  "[FallacyExtractor] AUDIT: Tool execution started",
  {
    timestamp: new Date().toISOString(),
    promptVersion: PROMPT_VERSION,
    textLength: textToAnalyze.length,
    textPreview: textToAnalyze.substring(0, 100),
    minSeverityThreshold: MIN_SEVERITY_THRESHOLD,
    maxIssues: MAX_ISSUES,
    hasDocumentText: !!input.documentText,
    hasChunkOffset: input.chunkStartOffset !== undefined,
    mode: input.text ? "chunk" : "single-pass",
  }

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Missing body validation: The PUT handler accepts arbitrary JSON (name, description, config, isDefault) without
schema validation/sanitization before persisting to the database.

Referred Code
const body = await request.json();
const { name, description, config, isDefault } = body;

// Check profile exists
const existing = await prisma.fallacyCheckerProfile.findUnique({
  where: { id },
});

if (!existing) {
  return NextResponse.json({ error: "Profile not found" }, { status: 404 });
}

// Check for duplicate name (excluding current profile)
if (name && name !== existing.name) {
  const duplicate = await prisma.fallacyCheckerProfile.findFirst({
    where: {
      agentId: existing.agentId,
      name,
      id: { not: id },
    },
  });


 ... (clipped 26 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Error propagation risk: The OpenRouter client throws errors that can include raw upstream response bodies, and it
is unclear from the diff whether these errors are always confined to internal logs versus
potentially being returned to end users.

Referred Code
if (!response.ok) {
  const errorText = await response.text().catch(() => '');
  let errorMessage = response.statusText;
  let errorDetails = '';

  try {
    const errorBody = JSON.parse(errorText) as OpenRouterError;
    errorMessage = errorBody.error.message || response.statusText;
    // Include full error body for debugging (especially useful for 429 rate limits)
    errorDetails = ` | Full response: ${errorText}`;
  } catch {
    // If not JSON, include raw text
    if (errorText) {
      errorDetails = ` | Response: ${errorText.substring(0, 500)}`;
    }
  }

  throw new Error(`OpenRouter API error (${response.status}): ${errorMessage}${errorDetails}`);
}

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Isolate comment errors safely

Wrap the buildFallacyComment call within a try/catch block to handle potential

errors for individual comments, preventing a single failure from halting the
entire
batch.

internal-packages/ai/src/analysis-plugins/plugins/fallacy-check/index.ts [977-994]

 private async generateCommentsForIssues(
   issues: FallacyIssue[],
   documentText: string
 ): Promise<Comment[]> {
   const commentPromises = issues.map(async (issue) => {
-    // Run in next tick to ensure true parallelism
-    await new Promise((resolve) => setImmediate(resolve));
-    const comment = await buildFallacyComment(issue, documentText, { logger });
-    // Filter out comments with empty descriptions
-    if (comment?.description.trim()) {
-      return comment;
+    await new Promise(resolve => setImmediate(resolve));
+    try {
+      const comment = await buildFallacyComment(issue, documentText, { logger });
+      if (comment?.description.trim()) {
+        return comment;
+      }
+    } catch (error) {
+      logger.warn('Error generating comment for issue:', error);
     }
     return null;
   });
   const commentResults = await Promise.all(commentPromises);
-  return commentResults.filter((comment): comment is Comment => comment !== null);
+  return commentResults.filter((c): c is Comment => c !== null);
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 8

__

Why: This is a critical improvement for robustness, as it prevents a single error
in comment generation from causing the entire analysis pipeline to fail for all
other valid issues.

Medium
Possible issue
Ensure consistent return object structure

Add unifiedUsage, actualApiParams, and responseMetrics with undefined
values to
the return object in the single-extractor case to ensure consistent
telemetry.

internal-packages/ai/src/tools/fallacy-judge/index.ts [299-326]

 // If only one extractor, accept all issues (no aggregation needed)
 if (input.extractorIds.length === 1) {
   const acceptedDecisions = input.issues.map((issue, idx) => ({
     decision: 'accept' as const,
     finalText: issue.exactText,
     finalIssueType: issue.issueType,
     finalFallacyType: issue.fallacyType,
     finalSeverity: issue.severityScore,
     finalConfidence: issue.confidenceScore,
     finalImportance: issue.importanceScore,
     finalReasoning: issue.reasoning,
     sourceExtractors: [issue.extractorId],
     sourceIssueIndices: [idx],
     judgeReasoning: 'Single extractor mode - all issues accepted',
   }));
 
   return {
     acceptedDecisions,
     rejectedDecisions: [],
     summary: {
       totalInputIssues: input.issues.length,
       uniqueGroups: input.issues.length,
       acceptedCount: input.issues.length,
       mergedCount: 0,
       rejectedCount: 0,
     },
+    unifiedUsage: undefined,
+    actualApiParams: undefined,
+    responseMetrics: undefined,
   };
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that the return object for the single-extractor
case is missing telemetry fields, leading to inconsistent return types and incomplete
data.

Medium
Require text input for extraction

Add a .refine check to the Zod inputSchema to ensure that either text
or
documentText is provided, preventing the tool from running without input.

internal-packages/ai/src/tools/fallacy-extractor/index.ts [88-102]

 const inputSchema = z.object({
   text: z.string().max(50000).optional().describe("Text chunk to analyze (optional if documentText provided)"),
   documentText: z.string().optional().describe("Full document text - used for analysis in single-pass mode, or for location finding in chunk mode"),
-  chunkStartOffset: z.number().min(0).optional().describe("Byte offset where this chunk starts in the full document (optimization for location finding)"),
-  model: z.string().optional().describe("Model to use (Claude or OpenRouter model ID)"),
-  temperature: z.union([
-    z.number().min(0).max(2),
-    z.literal('default'),
-  ]).optional().describe("Temperature for extraction (default: 0 for Claude, 0.1 for OpenRouter, 'default' to use model's native default)"),
-  thinking: z.boolean().optional().describe("Enable extended thinking/reasoning (default: true for Claude, varies for OpenRouter)"),
-  customSystemPrompt: z.string().optional().describe("Custom system prompt override"),
-  customUserPrompt: z.string().optional().describe("Custom user prompt override (document text appended)"),
-  minSeverityThreshold: z.number().min(0).max(100).optional().describe("Minimum severity threshold (default: 60)"),
-  maxIssues: z.number().min(1).max(100).optional().describe("Maximum issues to return (default: 15)"),
+  chunkStartOffset: z.number().min(0).optional().describe("Byte offset where this chunk starts in the full document"),
+  model: z.string().optional().describe("Model to use"),
+  temperature: z.union([z.number().min(0).max(2), z.literal('default')]).optional(),
+  thinking: z.boolean().optional(),
+  customSystemPrompt: z.string().optional(),
+  customUserPrompt: z.string().optional(),
+  minSeverityThreshold: z.number().min(0).max(100).optional(),
+  maxIssues: z.number().min(1).max(100).optional(),
+})
+.refine(data => !!(data.text || data.documentText), {
+  path: ['text', 'documentText'],
+  message: 'Either text or documentText must be provided',
 }) satisfies z.ZodType<FallacyExtractorInput>;

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that the tool could be called without any
text to analyze, and adding a Zod refine check provides robust, early validation.

Medium
Improve performance by caching Fuse.js instances

Improve performance in fuseSimilarity by caching and reusing Fuse.js instances
instead of creating a new one on each function call.

meta-evals/src/components/extractor-lab/fuzzy-dedup.ts [69-83]

+const fuseCache = new Map<string, Fuse<any>>();
+
+function getFuse(b: string): Fuse<any> {
+  if (!fuseCache.has(b)) {
+    const fuse = new Fuse([{ text: b }], {
+      keys: ["text"],
+      includeScore: true,
+      threshold: 1.0, // Accept all results, we'll check score ourselves
+      ignoreLocation: true,
+      minMatchCharLength: 2,
+    });
+    fuseCache.set(b, fuse);
+  }
+  return fuseCache.get(b)!;
+}
+
 export function fuseSimilarity(a: string, b: string): number {
-  const fuse = new Fuse([{ text: b }], {
-    keys: ["text"],
-    includeScore: true,
-    threshold: 1.0, // Accept all results, we'll check score ourselves
-    ignoreLocation: true,
-    minMatchCharLength: 2,
-  });
+  const fuse = getFuse(b);
 
   const results = fuse.search(a);
   if (results.length > 0 && results[0].score !== undefined) {
     return results[0].score;
   }
   return 1;
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a performance bottleneck by creating a new Fuse instance on every call and proposes a valid caching strategy, which significantly improves efficiency.

Medium
Handle object-based reasoning configuration

Update getClaudeThinkingConfig to correctly handle object-based reasoningEffort

configurations, such as { budget_tokens: ... }, to ensure explicit token budgets

are respected.

internal-packages/ai/src/tools/fallacy-extractor/index.ts [294-320]

 // For Anthropic models, convert reasoning effort to budget_tokens
 // Anthropic supports up to 128K thinking tokens
 const ANTHROPIC_MAX_THINKING_TOKENS = 128000;
 const EFFORT_PERCENTAGES: Record<string, number> = {
   minimal: 0.1,
   low: 0.3,
   medium: 0.5,
   high: 0.7,
   xhigh: 0.9,
 };
 
 // Calculate thinking config for Claude based on reasoning effort
 const getClaudeThinkingConfig = (): boolean | { type: 'enabled'; budget_tokens: number } => {
   if (!thinkingEnabled) return false;
 
   // Only set explicit budget if effort level is specified
   if (input.reasoningEffort && input.reasoningEffort !== 'none') {
-    const percentage = EFFORT_PERCENTAGES[input.reasoningEffort];
-    if (percentage) {
-      const budgetTokens = Math.floor(ANTHROPIC_MAX_THINKING_TOKENS * percentage);
-      return { type: 'enabled' as const, budget_tokens: budgetTokens };
+    if (typeof input.reasoningEffort === 'object' && 'budget_tokens' in input.reasoningEffort) {
+      return { type: 'enabled' as const, budget_tokens: input.reasoningEffort.budget_tokens };
+    }
+    if (typeof input.reasoningEffort === 'object' && 'effort' in input.reasoningEffort) {
+      const percentage = EFFORT_PERCENTAGES[input.reasoningEffort.effort];
+      if (percentage) {
+        const budgetTokens = Math.floor(ANTHROPIC_MAX_THINKING_TOKENS * percentage);
+        return { type: 'enabled' as const, budget_tokens: budgetTokens };
+      }
     }
   }
 
   // No effort specified - just return true, let wrapper use its default
   return true;
 };

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that getClaudeThinkingConfig does not handle object-based
reasoningEffort configurations, which would cause explicit token budgets to be
ignored, impacting cost and performance.

Low
Delete operations in transaction

Wrap the two delete operations within a Prisma transaction to ensure atomicity
and prevent partial data deletion on failure.

internal-packages/db/src/repositories/MetaEvaluationRepository.ts [398-407]

 async deleteSeries(seriesId: string): Promise<void> {
-  // Delete runs first (foreign key constraint)
-  await this.prisma.seriesRun.deleteMany({
-    where: { seriesId },
-  });
-  // Delete the series
-  await this.prisma.series.delete({
-    where: { id: seriesId },
-  });
+  await this.prisma.$transaction([
+    this.prisma.seriesRun.deleteMany({
+      where: { seriesId },
+    }),
+    this.prisma.series.delete({
+      where: { id: seriesId },
+    }),
+  ]);
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that the two delete operations should be atomic and proposes using a transaction, which improves data integrity and robustness.

Low
Prevent division-by-zero error in similarity calculation

Prevent a division-by-zero error in calculateSimilarity by handling cases where
input strings are empty, ensuring the function returns a valid numeric score.

apps/web/src/app/api/monitor/lab/runs/[id]/finalize/route.ts [269-278]

 // Simple text similarity (Jaccard on words)
 function calculateSimilarity(a: string, b: string): number {
-  const wordsA = new Set(a.toLowerCase().split(/\s+/));
-  const wordsB = new Set(b.toLowerCase().split(/\s+/));
+  const wordsA = new Set(a.toLowerCase().split(/\s+/).filter(w => w.length > 0));
+  const wordsB = new Set(b.toLowerCase().split(/\s+/).filter(w => w.length > 0));
 
   const intersection = new Set([...wordsA].filter((x) => wordsB.has(x)));
   const union = new Set([...wordsA, ...wordsB]);
 
+  if (union.size === 0) {
+    return wordsA.size === 0 && wordsB.size === 0 ? 1 : 0;
+  }
+
   return intersection.size / union.size;
 }
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies a potential division-by-zero edge case and provides a robust fix to prevent NaN results, improving the function's reliability.

Low
Account for missing new snapshots

In the finalization logic, handle cases where a new snapshot is not found for a
baseline snapshot by logging a warning and appropriately updating the
changedCount.

apps/web/src/app/api/monitor/lab/runs/[id]/finalize/route.ts [95-100]

 for (const baselineSnapshot of baselineSnapshots) {
   const newSnapshot = newSnapshots.find(
     s => s.documentId === baselineSnapshot.documentId
   );
   if (newSnapshot) {
     // Compare comments...
+  } else {
+    changedCount++;
+    logger.warn(`No new snapshot for document ${baselineSnapshot.documentId}`);
   }
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies a scenario where a baseline snapshot might not have a corresponding new snapshot, and proposes a reasonable way to handle and log this case.

Low
  • More

@michaelr524
Copy link
Collaborator Author

Closing in favor of split PRs for CodeRabbit review (under 150 files each):

  • Main PR: #TBD (125 files - core changes)
  • Follow-up PR: tooling changes (38 files - dev/ and meta-evals/)

Full history preserved in fallacy-checker-refactor branch.

@michaelr524
Copy link
Collaborator Author

Main PR created: #387

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants